CPSC 545/445 (Autumn 2003) - Class 11: Gene Finding (1)
Module 4, Part 1


---
4.1 Gene finding - background / motivation

increasing amount of genomic sequence data
-> interpretation of this data lagging behind.

for genomic dna sequence data from higher eukaryotes:
computational gene finding, i.e., 
identifying intron/exon structures, 
is one of the main problems of bioinformatics.

the problem is closely related to to fundamental
biochemical issues of specifying the precise
determinants of:
- transcription
- translation
- RNA splicing

the problem is also of significant practical
importantce, as computer software for gene finding
is routinely used by genome sequencing laboratories
to help identify genes in newly sequenced regions.

--
recap: gene structure

- genes of most eukaryotic are neither continuous nor contiguous:
  - they are seperated by long stretches of intergenic DNA
  - their coding sequences are interrupted by non-coding introns
- only a small part of the genome is conding sequence
  (human: ca. 3%)
- alternative splicing (at least 35% of human genes - Mironov et al., 1999)
- nested genes (Dunham et al., 1999)
- overlapping genes on the same or opposite strands
  (Schulz and Butler, 1989; Ashburner et al., 1999; Cooper et al., 1998)
- pseudogenes = non-functional sequences resembling real genes
  occur in numerous copies throughout the genome

regulatory regions are crucial for gene expression
their location relative to target gene is not uniquely determined:
- basic regulatory elements (e.g., TATA / CAT boxes) usually upstream
	of transcription start site
- enhancers and silencers can be distant upstream, downstream,
	even within introns


---
4.2 Approaches to computational gene finding


the computational gene finding problem

given raw sequence data, predict:
- coding and non-coding regions
- exons/introns
- splicing patterns
- transcription factors
- ...

-> genomic sequence annotation

[slide: Figure 1]


--
naive approach:
- search for characteristic subsequences (pattern matching)
  (e.g., TATA, CCAAT, GC boxes, etc.)

example: intron identification (GT-AG rule)

[slide: Figure 31.2]


problems:
- only covers most conserved signals (subsequences)
- these are not sufficient for characterising genes (exons)

--
ideal approach:
- complete simulation of DNA transcription, 
  RNA splicing/processing, RNA translation
- base gene prediction on results of this simulation

problem:
- assumes solution of most important, open biochemical problems
- computational complexity might be very high

--
finding open reading frames (ORFs):
- mark all stop codons in all reading frames
- long stretches of uninterrupted sequence between stop codons
  give candidates for genes (exons)
- for prokaryotes, also use initiation codon ATG

[slide: Figure 7-15]

problems:
- distribution of ORF length almost identical for
  random sequence data and real genomic sequence
  (claverie et al., 1996)
-> ORF length alone gives almost no information on protein 
  coding regions

--
how is it really done?

- signal detection (splice sites, promoter regions, etc.)

  note: signals are often not 100% conserved
  example: intron 5' splice site
  -> probabilistic models (as in phylogeny!)

[slide: Figure 7-15]
[slide: Figure 1.8 - intron 5' splice site]

- compositional properties of coding vs. non-coding DNA
  (GC content, hexamer frequencies)

- integration of the above
- integration with homology searching

-> modern gene finders predict individual functional elements
  and entire gene structure (sets of spliceable exons)

--
signal detection:

- simple motif search: typically not good enough
- Weight Matrix Method (WMM)
- Weight Array Method (WAM)
- Hidden Markov Models (HMMs)

--
fundamental method:

- build generative, probabilistic models of signals
- use these to compute probability for given sequence
- high probability of generation -> predict signal


---
4.3 Weight Matrix and Weight Array Methods


WMM method:

given: 
- frequency p^i_j of nucleotide j at position i
- sequence X=x1 x2 x3 ... xn

probability P{X} of generating X 
= p^1_x1 * p^2_x2 * ... * p^n_xn


--
generalisation:

WAM method, models dependencies between positions
(probability of seeing x2 at pos 2 depends on what was
 seen at pos 1)


--
where do the model parameters (p^i_j) come from?

- manual determination (human experience)
- training: learned from aligned sequence data of known signals

(problem: bias towards known data, 
	poor performance for unknown sequence data)


--
note: wmm/wam can still be too weak to reliably detect signals
	such as intron/exon boundaries

[slide: Fig. 1]


---
Resources:

D. Haussler. Computational Genefinding. 
	[online, from his webpage: www.cse.ucsc.edu/~haussler]

J.-M. Claverie, O. Poirot, F. Lopez. The difficulty of identifying genes
	in anonymous vertebrate sequences.
	Computers Chem. 21(4): 203-214, 1997.

J.W. Fickett. The gene identification problem: an overview for developers.
	Computers Chem. 20(1): 103-118, 1996.

R. Guigo. Computational gene identification: an open problem.
	Computers Chem. 21(4): 215-222, 1997.

---